{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "markdown"
    }
   },
   "source": [
    "# Contextual Chunk Headers (CCH) in Simple RAG\n",
    "\n",
    "Retrieval-Augmented Generation (RAG) improves the factual accuracy of language models by retrieving relevant external knowledge before generating a response. However, standard chunking often loses important context, making retrieval less effective.\n",
    "\n",
    "Contextual Chunk Headers (CCH) enhance RAG by prepending high-level context (like document titles or section headers) to each chunk before embedding them. This improves retrieval quality and prevents out-of-context responses.\n",
    "\n",
    "## Steps in this Notebook:\n",
    "\n",
    "1. **Data Ingestion**: Load and preprocess the text data.\n",
    "2. **Chunking with Contextual Headers**: Extract section titles and prepend them to chunks.\n",
    "3. **Embedding Creation**: Convert context-enhanced chunks into numerical representations.\n",
    "4. **Semantic Search**: Retrieve relevant chunks based on a user query.\n",
    "5. **Response Generation**: Use a language model to generate a response from retrieved text.\n",
    "6. **Evaluation**: Assess response accuracy using a scoring system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the Environment\n",
    "We begin by importing necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import json\n",
    "from openai import OpenAI\n",
    "import fitz\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Text and Identifying Section Headers\n",
    "We extract text from a PDF while also identifying section titles (potential headers for chunks)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_text_from_pdf(pdf_path):\n",
    "    \"\"\"\n",
    "    Extracts text from a PDF file and prints the first `num_chars` characters.\n",
    "\n",
    "    Args:\n",
    "    pdf_path (str): Path to the PDF file.\n",
    "\n",
    "    Returns:\n",
    "    str: Extracted text from the PDF.\n",
    "    \"\"\"\n",
    "    # Open the PDF file\n",
    "    mypdf = fitz.open(pdf_path)\n",
    "    all_text = \"\"  # Initialize an empty string to store the extracted text\n",
    "\n",
    "    # Iterate through each page in the PDF\n",
    "    for page_num in range(mypdf.page_count):\n",
    "        page = mypdf[page_num]  # Get the page\n",
    "        text = page.get_text(\"text\")  # Extract text from the page\n",
    "        all_text += text  # Append the extracted text to the all_text string\n",
    "\n",
    "    return all_text  # Return the extracted text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the OpenAI API Client\n",
    "We initialize the OpenAI client to generate embeddings and responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the OpenAI client with the base URL and API key\n",
    "client = OpenAI(\n",
    "    base_url=\"https://api.studio.nebius.com/v1/\",\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\")  # Retrieve the API key from environment variables\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking Text with Contextual Headers\n",
    "To improve retrieval, we generate descriptive headers for each chunk using an LLM model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_chunk_header(chunk, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Generates a title/header for a given text chunk using an LLM.\n",
    "\n",
    "    Args:\n",
    "    chunk (str): The text chunk to summarize as a header.\n",
    "    model (str): The model to be used for generating the header. Default is \"meta-llama/Llama-3.2-3B-Instruct\".\n",
    "\n",
    "    Returns:\n",
    "    str: Generated header/title.\n",
    "    \"\"\"\n",
    "    # Define the system prompt to guide the AI's behavior\n",
    "    system_prompt = \"Generate a concise and informative title for the given text.\"\n",
    "    \n",
    "    # Generate a response from the AI model based on the system prompt and text chunk\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": chunk}\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    # Return the generated header/title, stripping any leading/trailing whitespace\n",
    "    return response.choices[0].message.content.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chunk_text_with_headers(text, n, overlap):\n",
    "    \"\"\"\n",
    "    Chunks text into smaller segments and generates headers.\n",
    "\n",
    "    Args:\n",
    "    text (str): The full text to be chunked.\n",
    "    n (int): The chunk size in characters.\n",
    "    overlap (int): Overlapping characters between chunks.\n",
    "\n",
    "    Returns:\n",
    "    List[dict]: A list of dictionaries with 'header' and 'text' keys.\n",
    "    \"\"\"\n",
    "    chunks = []  # Initialize an empty list to store chunks\n",
    "\n",
    "    # Iterate through the text with the specified chunk size and overlap\n",
    "    for i in range(0, len(text), n - overlap):\n",
    "        chunk = text[i:i + n]  # Extract a chunk of text\n",
    "        header = generate_chunk_header(chunk)  # Generate a header for the chunk using LLM\n",
    "        chunks.append({\"header\": header, \"text\": chunk})  # Append the header and chunk to the list\n",
    "\n",
    "    return chunks  # Return the list of chunks with headers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting and Chunking Text from a PDF File\n",
    "Now, we load the PDF, extract text, and split it into chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sample Chunk:\n",
      "Header: \"Introduction to Artificial Intelligence: Understanding the Foundations and Evolution\"\n",
      "Content: Understanding Artificial Intelligence \n",
      "Chapter 1: Introduction to Artificial Intelligence \n",
      "Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \n",
      "to perform tasks commonly associated with intelligent beings. The term is frequently applied to \n",
      "the project of developing systems endowed with the intellectual processes characteristic of \n",
      "humans, such as the ability to reason, discover meaning, generalize, or learn from past \n",
      "experience. Over the past few decades, advancements in computing power and data availability \n",
      "have significantly accelerated the development and deployment of AI. \n",
      "Historical Context \n",
      "The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. \n",
      "However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop \n",
      "in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving \n",
      "and symbolic methods. The 1980s saw a rise in exp\n"
     ]
    }
   ],
   "source": [
    "# Define the PDF file path\n",
    "pdf_path = \"data/AI_Information.pdf\"\n",
    "\n",
    "# Extract text from the PDF file\n",
    "extracted_text = extract_text_from_pdf(pdf_path)\n",
    "\n",
    "# Chunk the extracted text with headers\n",
    "# We use a chunk size of 1000 characters and an overlap of 200 characters\n",
    "text_chunks = chunk_text_with_headers(extracted_text, 1000, 200)\n",
    "\n",
    "# Print a sample chunk with its generated header\n",
    "print(\"Sample Chunk:\")\n",
    "print(\"Header:\", text_chunks[0]['header'])\n",
    "print(\"Content:\", text_chunks[0]['text'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Embeddings for Headers and Text\n",
    "We create embeddings for both headers and text to improve retrieval accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_embeddings(text, model=\"BAAI/bge-en-icl\"):\n",
    "    \"\"\"\n",
    "    Creates embeddings for the given text.\n",
    "\n",
    "    Args:\n",
    "    text (str): The input text to be embedded.\n",
    "    model (str): The embedding model to be used. Default is \"BAAI/bge-en-icl\".\n",
    "\n",
    "    Returns:\n",
    "    dict: The response containing the embedding for the input text.\n",
    "    \"\"\"\n",
    "    # Create embeddings using the specified model and input text\n",
    "    response = client.embeddings.create(\n",
    "        model=model,\n",
    "        input=text\n",
    "    )\n",
    "    # Return the embedding from the response\n",
    "    return response.data[0].embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Generating embeddings: 100%|██████████| 42/42 [02:56<00:00,  4.21s/it]\n"
     ]
    }
   ],
   "source": [
    "# Generate embeddings for each chunk\n",
    "embeddings = []  # Initialize an empty list to store embeddings\n",
    "\n",
    "# Iterate through each text chunk with a progress bar\n",
    "for chunk in tqdm(text_chunks, desc=\"Generating embeddings\"):\n",
    "    # Create an embedding for the chunk's text\n",
    "    text_embedding = create_embeddings(chunk[\"text\"])\n",
    "    # Create an embedding for the chunk's header\n",
    "    header_embedding = create_embeddings(chunk[\"header\"])\n",
    "    # Append the chunk's header, text, and their embeddings to the list\n",
    "    embeddings.append({\"header\": chunk[\"header\"], \"text\": chunk[\"text\"], \"embedding\": text_embedding, \"header_embedding\": header_embedding})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performing Semantic Search\n",
    "We implement cosine similarity to find the most relevant text chunks for a user query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cosine_similarity(vec1, vec2):\n",
    "    \"\"\"\n",
    "    Computes cosine similarity between two vectors.\n",
    "\n",
    "    Args:\n",
    "    vec1 (np.ndarray): First vector.\n",
    "    vec2 (np.ndarray): Second vector.\n",
    "\n",
    "    Returns:\n",
    "    float: Cosine similarity score.\n",
    "    \"\"\"\n",
    "    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def semantic_search(query, chunks, k=5):\n",
    "    \"\"\"\n",
    "    Searches for the most relevant chunks based on a query.\n",
    "\n",
    "    Args:\n",
    "    query (str): User query.\n",
    "    chunks (List[dict]): List of text chunks with embeddings.\n",
    "    k (int): Number of top results.\n",
    "\n",
    "    Returns:\n",
    "    List[dict]: Top-k most relevant chunks.\n",
    "    \"\"\"\n",
    "    # Create an embedding for the query\n",
    "    query_embedding = create_embeddings(query)\n",
    "\n",
    "    similarities = []  # Initialize a list to store similarity scores\n",
    "    \n",
    "    # Iterate through each chunk to calculate similarity scores\n",
    "    for chunk in chunks:\n",
    "        # Compute cosine similarity between query embedding and chunk text embedding\n",
    "        sim_text = cosine_similarity(np.array(query_embedding), np.array(chunk[\"embedding\"]))\n",
    "        # Compute cosine similarity between query embedding and chunk header embedding\n",
    "        sim_header = cosine_similarity(np.array(query_embedding), np.array(chunk[\"header_embedding\"]))\n",
    "        # Calculate the average similarity score\n",
    "        avg_similarity = (sim_text + sim_header) / 2\n",
    "        # Append the chunk and its average similarity score to the list\n",
    "        similarities.append((chunk, avg_similarity))\n",
    "\n",
    "    # Sort the chunks based on similarity scores in descending order\n",
    "    similarities.sort(key=lambda x: x[1], reverse=True)\n",
    "    # Return the top-k most relevant chunks\n",
    "    return [x[0] for x in similarities[:k]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running a Query on Extracted Chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: What is 'Explainable AI' and why is it considered important?\n",
      "Header 1: \"Building Trust in AI: Addressing Transparency, Explainability, and Accountability\"\n",
      "Content:\n",
      "systems. Explainable AI (XAI) \n",
      "techniques aim to make AI decisions more understandable, enabling users to assess their \n",
      "fairness and accuracy. \n",
      "Privacy and Data Protection \n",
      "AI systems often rely on large amounts of data, raising concerns about privacy and data \n",
      "protection. Ensuring responsible data handling, implementing privacy-preserving techniques, \n",
      "and complying with data protection regulations are crucial. \n",
      "Accountability and Responsibility \n",
      "Establishing accountability and responsibility for AI systems is essential for addressing potential \n",
      "harms and ensuring ethical behavior. This includes defining roles and responsibilities for \n",
      "developers, deployers, and users of AI systems. \n",
      "Chapter 20: Building Trust in AI \n",
      "Transparency and Explainability \n",
      "Transparency and explainability are key to building trust in AI. Making AI systems understandable \n",
      "and providing insights into their decision-making processes helps users assess their reliability \n",
      "and fairness. \n",
      "Robustness and Reliability \n",
      "\n",
      "\n",
      "Header 2: \"Building Trust in AI: Essential Factors for Reliability and Fairness\"\n",
      "Content:\n",
      "to building trust in AI. Making AI systems understandable \n",
      "and providing insights into their decision-making processes helps users assess their reliability \n",
      "and fairness. \n",
      "Robustness and Reliability \n",
      "Ensuring that AI systems are robust and reliable is essential for building trust. This includes \n",
      "testing and validating AI models, monitoring their performance, and addressing potential \n",
      "vulnerabilities. \n",
      "User Control and Agency \n",
      "Empowering users with control over AI systems and providing them with agency in their \n",
      "interactions with AI enhances trust. This includes allowing users to customize AI settings, \n",
      "understand how their data is used, and opt out of AI-driven features. \n",
      "Ethical Design and Development \n",
      "Incorporating ethical considerations into the design and development of AI systems is crucial for \n",
      "building trust. This includes conducting ethical impact assessments, engaging stakeholders, and \n",
      "adhering to ethical guidelines and standards. \n",
      "Public Engagement and Education \n",
      "Engaging th\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Load validation data\n",
    "with open('data/val.json') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "query = data[0]['question']\n",
    "\n",
    "# Retrieve the top 2 most relevant text chunks\n",
    "top_chunks = semantic_search(query, embeddings, k=2)\n",
    "\n",
    "# Print the results\n",
    "print(\"Query:\", query)\n",
    "for i, chunk in enumerate(top_chunks):\n",
    "    print(f\"Header {i+1}: {chunk['header']}\")\n",
    "    print(f\"Content:\\n{chunk['text']}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating a Response Based on Retrieved Chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the system prompt for the AI assistant\n",
    "system_prompt = \"You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'\"\n",
    "\n",
    "def generate_response(system_prompt, user_message, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Generates a response from the AI model based on the system prompt and user message.\n",
    "\n",
    "    Args:\n",
    "    system_prompt (str): The system prompt to guide the AI's behavior.\n",
    "    user_message (str): The user's message or query.\n",
    "    model (str): The model to be used for generating the response. Default is \"meta-llama/Llama-2-7B-chat-hf\".\n",
    "\n",
    "    Returns:\n",
    "    dict: The response from the AI model.\n",
    "    \"\"\"\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_message}\n",
    "        ]\n",
    "    )\n",
    "    return response\n",
    "\n",
    "# Create the user prompt based on the top chunks\n",
    "user_prompt = \"\\n\".join([f\"Header: {chunk['header']}\\nContent:\\n{chunk['text']}\" for chunk in top_chunks])\n",
    "user_prompt = f\"{user_prompt}\\nQuestion: {query}\"\n",
    "\n",
    "# Generate AI response\n",
    "ai_response = generate_response(system_prompt, user_prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the AI Response\n",
    "We compare the AI response with the expected answer and assign a score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation Score: 0.5\n"
     ]
    }
   ],
   "source": [
    "# Define evaluation system prompt\n",
    "evaluate_system_prompt = \"\"\"You are an intelligent evaluation system. \n",
    "Assess the AI assistant's response based on the provided context. \n",
    "- Assign a score of 1 if the response is very close to the true answer. \n",
    "- Assign a score of 0.5 if the response is partially correct. \n",
    "- Assign a score of 0 if the response is incorrect.\n",
    "Return only the score (0, 0.5, or 1).\"\"\"\n",
    "\n",
    "# Extract the ground truth answer from validation data\n",
    "true_answer = data[0]['ideal_answer']\n",
    "\n",
    "# Construct evaluation prompt\n",
    "evaluation_prompt = f\"\"\"\n",
    "User Query: {query}\n",
    "AI Response: {ai_response}\n",
    "True Answer: {true_answer}\n",
    "{evaluate_system_prompt}\n",
    "\"\"\"\n",
    "\n",
    "# Generate evaluation score\n",
    "evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)\n",
    "\n",
    "# Print the evaluation score\n",
    "print(\"Evaluation Score:\", evaluation_response.choices[0].message.content)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-new-specific-rag",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
